table: native scan and Arrow stream integration by abnobdoss · Pull Request #2 · abnobdoss/iceberg-python

abnobdoss · 2026-05-25T00:56:10Z

Status: active integration branch

Scope:

Env-gated native Arrow batch-reader path for PyIceberg scans.
Table-property and env mode selector for PyArrow, Rust read, and Rust plan-and-read.
Eager to_arrow materializes through the same guarded native batch reader when a native mode is enabled.
Arrow PyCapsule stream support for append, overwrite, Table, DataScan, to_polars, and to_duckdb.
Native equality-delete payload conversion at the adapter boundary.
Opt-in native upsert match-filter route for benchmarking.
Reproducible benchmark harness that records wall time, CPU time, RSS, rows, checksum, and fallback status.

Current value proven on the rebased branch:

Local 1M rows across 16 files: Rust plan-and-read is correctness parity but not a speed win. Full, projection, and filter are about 0.85x to 0.93x PyArrow wall time and use more RSS.
Local planning-heavy 128-file scans: Rust plan-and-read is about 1.32x faster for projection and 1.53x faster for filter, with lower CPU but much higher RSS.
Local planning-heavy 512-file spot check: Rust plan-and-read is about 1.49x faster for projection and 1.52x faster for filter, again with much higher RSS.
PyCapsule PyArrow handoff is effectively parity with normal PyArrow in this harness; it improves interop but is not a scan-planning speedup by itself.

Verification:

Focused pytest suite passed for Arrow capsule, scan backend selection, pyiceberg-core adapters, and native upsert route.
Commit hooks passed: trailing whitespace, EOF, AST, Ruff, Ruff format, mypy, pydocstyle, codespell.
Local fork pyiceberg-core release build imports the required scan, expression, schema, and file_io modules.

Known limits:

Published pyiceberg-core package does not expose the required native scan modules; reviewers need a local fork wheel until upstream packaging catches up.
Equality-delete support is adapter-level only so far; end-to-end equality-delete read parity still needs a real equality-delete table fixture.
Remote object-store, latency-injected S3, and concurrent workload behavior are not measured yet.
Memory amplification is the main production risk in the planning-heavy wins.

No external issue references.

Fan scan tasks out over a thread pool of native ArrowReaders so decode uses multiple cores instead of one, streaming batches as they complete with at most one decoded batch per shard in flight. A default batch size amortizes the per-batch GIL handoff that otherwise dominates the fan-in. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

A limited scan previously always fell back to PyArrow. Push the limit through the native reader instead: truncate the streamed result at the limit (slicing the crossing batch and closing the shards so they stop decoding early) and cap the batch size to the limit so a small limit does not decode a full batch per shard. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The native reader emits arrow-rs's own types (string rather than large_string, and run-end-encoded identity-partition columns), so its output diverged from the PyArrow scan path. Cast every batch to schema_to_pyarrow(projected_schema), decoding run-end-encoded columns first since there is no direct cast kernel for them, so the native path is a faithful drop-in. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

PYICEBERG_RUST_SCAN_PLANNING plans the scan in pyiceberg-core (Table.plan_files) instead of PyIceberg's Python manifest planning, then streams the planned tasks through the same sharded, casted reader as the read path. Falls back to PyArrow on any scan pyiceberg-core cannot handle. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The native path crashed on a normal REST+S3 catalog instead of degrading: non-str FileIO props (the REST auth manager) reached the Rust binding, S3 path-style/region were never translated to the opendal keys, and the fallback guard missed decode-time errors and schemeless local paths. Filter props to strings, mirror PyArrow's path-style default and pass region, prime the first batch so a decode mismatch falls back too, and skip native for bare paths. Scoped to static-credential FileIO -- a refreshing auth manager is still not carried to the native path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

plan_files() now resolves a ScanPlanner -- local manifest planning or REST server-side, exactly as before -- or uses one injected on the scan. That injection point is the seam a Rust or engine-specific planner plugs into without touching any call site. The bundled RustScanPlanner is an opt-in, deliberately-unimplemented stub: the native plan output exposes only booleans/counts, not the residual, partition data, or deletes needed to rebuild a faithful FileScanTask, so the fused native read path stays the supported native route. Env flags and default resolution are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

abnobdoss mentioned this pull request May 25, 2026

superseded: native partition-aware scan draft #3

Closed

Abanoub Doss and others added 9 commits June 6, 2026 11:44

feat(io): add pyiceberg-core scan adapters

f8a5677

feat(table): add opt-in pyiceberg-core arrow reader

dc0a7af

feat(table): route eager native scans through batch reader

deaa353

abnobdoss force-pushed the aba-158-opt-in-rust-arrow-scan branch from 60a6ee3 to deaa353 Compare June 6, 2026 16:47

abnobdoss changed the base branch from aba-156-157-core-adapters to main June 6, 2026 17:02

feat(table): wire native scan product paths

1ec08ff

abnobdoss changed the title ~~table: add opt-in pyiceberg-core arrow reader~~ table: native scan and Arrow stream integration Jun 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

table: native scan and Arrow stream integration#2

table: native scan and Arrow stream integration#2
abnobdoss wants to merge 10 commits into
mainfrom
aba-158-opt-in-rust-arrow-scan

abnobdoss commented May 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abnobdoss commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abnobdoss commented May 25, 2026 •

edited

Loading